In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import numpy as np
import pandas as pd
from scipy import stats

In [2]:
def plot_dist(dist, xmin, xmax, ax):
    x = np.linspace(xmin, xmax, 1000)
    y = dist.pdf(x)
    ax.plot(x,y)
    ax.set_ylim(0,y.max() + y.max()/4.0)

Chapter 2 - Probability

This chapter introduces probability theory (and the differences between frequentists and baysians), some common statistics and examples of discrete and continous distributions. It also presents transformation of variables, monte carlo methods and information theory.

Rules of probability

Sum rule

The probability of the conjunction of two events (or assertions) is given by:

$$p(A\ or\ B) = p(A) + p(B) - p(A\ and\ B)$$

Product rule

The probability of the event $A$ and $B$ is given by:

$$p(A\ and\ B) = p(A|B)p(B)$$

Conditional

The probability of the event $A$ given that the event $B$ is true is given by:

$$p(A|B) = \frac{p(A\ and\ B)}{p(B)}$$

Bayes rule

The Bayes rule is:

$$p(X|Y) = \frac{p(X)p(Y|X)}{p(Y)}$$

This can be derived from the sum and the product rule

Independence

The events $A$ and $B$ are independent if:

$$p(A\ and\ B) = p(A)p(B)$$

Note that this is basically the product rule, with the condition that $p(A|B) = p(A)$.

Continuous random variables

To deal with continuous random variables we define $F(x) = p(X\leq x)$. This means that:

$$p(a < X \leq b) = F(b) - F(a)$$

Common Statistics

Quantiles

The $\alpha$ quantile of a cdf $F$, denoted $F^{-1}(\alpha)$, is the value $x_\alpha$ such that

$$F(x_\alpha) = P(X \leq x_\alpha) = \alpha$$

The value $F^{-1}(0.5)$ is the median of the distribution.

Mean

The mean or expected value of a discrete distribution, commonly denoted by $\mu$, is defined as

$$\mathbb{E}[X] = \sum_{x\in\mathcal{X}}x~p(x)$$

Whereas for a continuous distribution, the mean is defined as

$$\mathbb{E}[X] = \int_{\mathcal{X}}x~p(x)$$

Variance

The variance, denoted by $\sigma^2$, is measure of the "spread" of a distribution, defined as

$$\sigma^2 = \mathrm{var}[X] = \mathbb{E}{(X-\mu)^2}$$

where $\mu = \mathbb{E}[X]$. A useful result is

$$\mathbb{E}[X^2] = \sigma^2 + \mu^2$$

The standard deviation is defined as $\mathrm{std}[X] = \sqrt{\sigma^2} = \sigma$

Covariance and correlation

The covariance between two random variable measures the degree which they are linearly related

$$\mathrm{cov}[X, Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$

If $\mathbf{x}$ is a d-dimensional vector, it covariance matrix is given by:

$$\mathrm{cov}[\mathbf{x}] = \mathbb{E}\left[(X - \mathbb{E}[X])(X - \mathbb{E}[X])\right] = $$$$= \begin{bmatrix}\mathrm{var}[X_1] & \mathrm{cov}[X_1, X_2] & \dots & \mathrm{cov}[X_1, X_d] \\ \mathrm{cov}[X_2, X_1] & \mathrm{var}[X_2] & \dots & \mathrm{cov}[X_2, X_d] \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{cov}[X_d, X_1] & \mathrm{cov}[X_1, X_2] & \dots & \mathrm{var}[X_d] \\ \end{bmatrix}$$

The Pearson correlation coefficient between two random variables $X$ and $Y$ is given by:

$$\mathrm{corr}\left[X, Y\right] = \frac{\mathrm{cov}\left[X, Y\right]}{\sqrt{\mathrm{var}[X]\mathrm{var}[Y]}}$$

Common discrete distributions

  • Bernoulli:
$$\mathrm{Ber}(x~|~\theta) = \left\{ \begin{array}{ll} \theta & \mbox{if } x = 1 \\ 1-\theta & \mbox{if } x = 0 \end{array} \right. $$
  • Binomial:
$$\mathrm{Bin}(k~|~n,\theta) = \binom{n}{k}\theta^k(1-\theta)^{n-k}$$
  • Multinomial:
$$\mathrm{Mu}(x~|~n,\theta) = \binom{n}{x_1,...,x_K}\prod_{j=1}^{K}\theta_{j}^{x_j}$$

the multinomial coeffiecient is defined as

$$\binom{n}{x_1,...,x_K} = \frac{n!}{x_1! \dots x_K!}$$
  • Poisson
$$\mathrm{Poi}(x~|~\lambda) = e^{-\lambda}\frac{\lambda^x}{x!}$$
  • Empirical distribution

Given a dataset $\mathcal{D} = \{x_1, \dots, x_N\}$, the empirical distribution is defined as

$$p_{\mathrm{emp}}(A) = \frac{1}{N}\sum_{i=1}^{N}w_i \delta_{x_i}(A)$$

where $0\leq w_i \leq 1$ and $\sum w_i = 1$

Common continuous distributions

  • Normal $$ P(x~|~\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}~e^{-\frac{1}{2\sigma^2}(x-\mu)^2} $$

In [3]:
ax = plt.subplot(111)
plot_dist(stats.norm, -4, 4, ax)


  • Laplace $$ P(x~|~\mu, b) = \frac{1}{2b}~\exp\left(-\frac{|x-\mu|}{b}\right)$$

In [4]:
ax = plt.subplot(111)
plot_dist(stats.laplace, -6, 6, ax)


  • Gamma $$ P(T~|~\text{shape}=a,\text{rate}=b) = \frac{b^a}{\Gamma(a)}~T^{a-1}e^{-Tb}$$

Where $$\Gamma(x) = \int_{0}^{\infty}u^{x-1}e^{-u}~\mathrm{d}u$$


In [5]:
ax = plt.subplot(111)
plot_dist(stats.gamma(1,2), 0, 6, ax)


  • Exponential $$P(x~|~\lambda) = \lambda e^{-\lambda x}$$
  • Chi-squared
$$P(x~|~\nu) = Gamma\left(x~\Bigg|~\frac{\nu}{2}, \frac{1}{2}\right)$$
  • Beta $$P(x~|~a,b) = \frac{1}{B(a,b)}x^{a-1}(1-x)^{b-1}$$

Where $$B(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}$$

  • Pareto $$P(x~|~k,m) = km^kx^{-(k+1)}\mathbb{I}(x\geq m)$$

  • Student t $$P(x~|~\nu) = \frac{\Gamma\left(\frac{\nu + 1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)\sqrt{\nu\pi}}\left(1 + \frac{x^2}{\nu}\right)^{-\frac{\nu + 1}{2}}$$

where $\nu$ is the number of degrees of freedom

Joint probability distributions

  • Multivariate Normal $$P(\mathbf{x}~|~\mathbf{\mu}, \mathbf{\Sigma}) = \frac{1}{(2\pi)^{D/2} \left|\mathbf{\Sigma}\right|^{1/2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{\mu})^\intercal \Sigma^{-1}(\mathbf{x} - \mathbf{\mu})\right]$$

where $\mathbf{\mu} = \mathbb{E}[\mathbf{x}]$ is the mean and $\mathbf{\Sigma} = cov[\mathbf{x}]$ is the covariance matrix.


In [6]:
x = np.arange(-4.0, 4.0, 0.1)
y = np.arange(-3.0, 3.0, 0.1)
X, Y = np.meshgrid(x, y)
Z = mlab.bivariate_normal(X, Y, sigmaxy=0.7)

plt.contour(X,Y,Z);


  • Multivariate Student t $$P(\mathbf{x}~|~\mathbf{\mu}, \mathbf{\Sigma}, \nu) = \frac{\Gamma(\nu/2 + D/2)}{\Gamma(\nu/2)}~\big|\pi \mathbf{V}\big|^{-1/2} \times \left[1 + (\mathbf{\mathbf{x}} - \mathbf{\mu})^\intercal \mathbf{V}^{-1}(\mathbf{\mathbf{x}} - \mathbf{\mu})\right]^{-\frac{\nu+D}{2}}$$

Where $\Sigma$ is called \emph{scale matrix}, $\nu$ is a scalar that controls how fat the tails of the distribution are (the bigger it is, the slimmer are the tails, and the distribution tends to the multivariate normal) and $V = \nu\Sigma$. this distributions has the following properties: $\mu$ is the mean and also the median, and $\frac{\nu}{\nu - 2}\Sigma$.

  • Dirichlet $$P(x~|~\alpha) = \frac{1}{B(\alpha)} \prod_{k=1}^{K} x_k^{\alpha_k - 1} \mathbb{I}(x \in S_k)$$

where $S_k = \left\{x: 0 \leq x_k \leq 1, \sum_{k=1}^{K} x_k = 1\right\}$ is the support of the distribution and $B(\alpha) = \frac{\prod_{k=1}^{K} \Gamma(\alpha_k)}{\Gamma\left(\sum_{k=1}^{K} \alpha_k\right)}$ is the beta function for N variables.

Central limit theorem

Consider $N$ idenpendent and identically distributed random varibles with pdf $p(x_i)$ (not necessarily gaussian) with mean $\mu$ and variance $\sigma^2$. Let $S_N = \sum_{i=1}^{N} X_i$ be the sum of these random variable. the central limit theorem states that:

$$P(S_N = s) = \frac{1}{\sqrt{2\pi N\sigma^2}}\exp\left(-\frac{(s - N\mu)^2}{2N\sigma^2}\right)$$

That is, the distribution of

$$Z_N = \frac{S_N - N\mu}{\sigma\sqrt{N}} = \frac{\bar{X} - \mu}{\sigma/\sqrt{N}}$$

where \bar{X} is the empirical mean, converges to the standard normal.

Transformation of random variables

Linear tranformations

If $\mathbf{y} = f(\mathbf{x}) = \mathbf{A}\mathbf{x} + \mathbf{b}$ then:

$$\mathbb{E}[\mathbf{y}] = \mathbf{A}\mathbf{\mu} + \mathbf{b}$$$$\mathrm{cov}[y] = \mathbf{A} \mathbf{\Sigma} \mathbf{A}^\intercal$$

General tranformations

Discrete random variable

If $X$ is a discrete random variable, we can derive the pdf of $y = f(x)$ by summing up the probability mass for all $x$ such that $f(x) = y$:

$$P(y) = \sum_{x:~f(x)=y} P(x)$$

Continuous random variable

If X is continuous we work with the cdf and do instead:

$$P_y(y) = P_y(f(X) \leq y) = P_y(X \leq f^{-1}(y)) = P_y(f^{-1}(y))$$

Taking the derivatives we get:

$$p_y(y) = p_x(x) \left|\frac{dx}{dy}\right|$$

Multivariate transformation

If $X$ is a multivariate continuous random variable we get:

$$p_y(y) = p_x(x) \left|\mathrm{det} ~ \mathbf{J}_{\mathbf{y} \rightarrow \mathbf{x}}\right|$$

Where $\mathbf{J}$ is the jacobian matrix.

Monte Carlo Methods

Using Monte Carlo to compute distribution of a transformed variable

First generate $S$ samples from the distribution, call them $x_1, \dots, x_S$. We approximate $f(X)$ by using the empirical distribution of $\{f(x_s)\}_{s=1}^{S}$. To compute expected value of any function of a random variable we do:

$$\mathbb{E}[f(X)] = \int f(x)p(x)dx \approx \frac{1}{S}\sum_{s=1}^{S} f(x_s)$$

By varying the function $f$ we can approximate quantities as mean, variave, cdf and median of a variable

Using Monte Carlo to estimate $\pi$

We know that the area of a circle is given by $\pi r^2$ but it is also given by:

$$A = \int_{-r}^r \int_{-r}^r \mathbb{I}(x^2 + y^2 \leq r^2) dx dy$$

therefore $\pi = \frac{A}{r^2}$. To use Monte Carlo to estimate $A$, we simply take $p(x)$ and $p(y)$ as uniform distributions on $[-r,r]$ so that $p(x) = p(y) = 1/2r$ and $f(x,y)$ to be $\mathbb{I}(x^2 + y^2 \leq r^2)$. So, by using the Monte Carlo approximation:

$$A = 4r^2 \int_{-r}^r \int_{-r}^r \mathbb{I}(x^2 + y^2 \leq r^2) p(x) p(y)dx dy = 4r^2 \frac{1}{S} \sum_{s=1}^{S} f(x_s, y_s)$$

In [7]:
def plot_circle(radius):
    fig, ax = plt.subplots(figsize=(5,5))
    ax.set_xlim(-radius, radius)
    ax.set_ylim(-radius, radius)
    circle = plt.Circle((0,0), radius, color='k', fill=False)
    ax.add_artist(circle)
    return ax

def in_circle(point, radius):
    x, y = point
    return x**2 + y**2 <= radius**2
    

def sample_points(a, b, n):
    for i in range(n):
        x = np.random.uniform(a, b)
        y = np.random.uniform(a, b)
        yield x, y

def compute_pi(n_points, radius=1):
    ax = plot_circle(radius)
    count = 0
    for point in sample_points(-radius, radius, n_points):
        color = 'b'
        if in_circle(point, radius):
            count += 1
            color = 'r'
        ax.scatter(point[0], point[1], color=color)
    return 4 * count/n_points

pi = compute_pi(10000)
print("The value of pi is:", pi)


The value of pi is: 3.1524

Monte Carlo accuracy

Using the central limit theorem, onde can show that:

$$(\hat{\mu} - \mu) \rightarrow \mathcal{N}\left(0, \frac{\sigma^2}{S}\right)$$

Where $\sigma^2 = \mathrm{var}[f(X)]$ can be estimated using:

$$\hat{\sigma}^2 = \frac{1}{S} \sum_{s=1}^{S} (f(x_s) - \hat{\mu})$$

Then we have:

$$P\left\{\mu - 1.96\frac{\hat{\sigma}}{\sqrt{S}} \leq \hat{\mu} \leq \mu + 1.96\frac{\hat{\sigma}}{\sqrt{S}}\right\} \approx 0.95$$

That is, if we want to have a tolerance of $1.96\frac{\hat{\sigma}}{\sqrt{S}}$ with $95\%$ of probability, we can use this method. The $S$ can be used to adjust the tolerance: the higher the $S$, the smaller the tolerance is, but the slower the approximation will be.

Information Theory

Altought Information Theooy is concerned with representing data in a compact fashiona and transmitting it in a robust manner, it has a lot to do with probability, so we introduce a few of it's most useful concepts

Entropy

The entropy of a random variable $X$ is given by:

$$\mathbb{H}(X) = -\sum_{k=1}^{K} p(k) \log p(k)$$

measured in bits if we use log base 2 and in nats if we use log base $e$. This measures the amount of incentainty we have about random variable $X$.

KL divergence

A way to measure the similarity between two distributions $p$ and $q$ is to use the KL-Divergence (or relative entropy) defined as follows:

$$\mathbb{KL}[p\|q] = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k} = - \mathbb{H}(p) + \mathbb{H}(p,q)$$

Where $\mathbb{H}(p,q) = - \sum_k p_k \log q_k$ is the cross entropy, that can be interpreted as the average number of bits needed to encode data coming from a source with distribution $p$ when we use a model $q$ to define our codebook. Therefore the KL-divergence can be interpreted as the extra information needed to encode the data, due to the use of distribution $q$ instead of $p$. The KL-Divergence is zero only if $p=q$ and is always positive.

Mutual information

To measure how much knowing a random variable $X$ tells you about another one $Y$ we could use the correlation or coovariance, but it only works for real valued random variables and only captures linear relations. A more general approach is to use the mutual information:

$$I(X;Y) = \mathbb{KL}[P(X,Y)\|P(X)P(Y)]$$

that is zero if the variables are independent. This can be read as:

$$I(X;Y) = \mathbb{H}(X) - \mathbb{H}(X|Y)$$

where $\mathbb{H}(X|Y) = \sum_y p(y) \mathbb{H}(X|Y=y)$ is the conditional entropy. This way we can interpret the MI as the reduction of uncertainty about X after knowing about $Y$. For continuous random variables, one can use the generalization of the MI called MIC (maximal information coefficient).